Whole game

Goal of this part is a rapid overview of the main tools of data science: importing, tidying, transforming, visualizing


1. Data Visualization

1.1. Introduction

# tidyverse packages
# install.packages('tidyverse')
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

It tells you which functions from the tidyverse conflict with functions in base R or other packages.

# install.packages('palmerpenguins')
# install.packages('ggthemes')
library(palmerpenguins)
library(ggthemes)

Use palmerpenguins package, which include the penguins dataset. Also the ggthemes package offers a colorblind sage color palette




2. First Steps

Do penguins with longer filppers weigh more or less than penguins with shorter flippers? What does the relationship between flipper length and body mass look like? Is it positive? negative? linear? nonlinear? Does the relationship vary by the species of the penguins? How about by the island where the penguin lives?


The penguins data frame
  • Variable: quantity, quality, or property that you can measure
  • Value: the state of a variable when you measure it
  • Observation: set of measurements made under similar conditions
  • Tabular data: set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own columnm and each observaation in its own row.

In the tidyverse, it use special dataframes called tibbles

penguins
glimpse(penguins) # str(penguins)와 비슷
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…


Ultimate goal

Our ultimate goal is to create visualization displaying the relationship between flipper lengths and body masses of these penguins, taking into consideration the species of the penguin.


Creating a ggplot

In ggplot2, we begin a plot with the function ggplot(). It defines a plot object that you then add layers to.
arguments are

  • data: dataset to use in the graph
  • mapping: defines how variables in our dataset are mapped to visual properties(aesthetic) of our plot
ggplot(data = penguins)

It creates empty graph that is primed to display the data. We can think of it like an empty canvas we’ll paint the reaming layers of our plot onto.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
)

mapping argument is always definend in the aes() function, and x, y areguments of aes() specify which variables to map to the x and y axes.

We need to define a geom: the geometrical object that a plot uses to represent data.

  • geom_bar()
  • geom_line()
  • geom_point()
  • geom_boxplot()
  • etc…
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Warning message: ggplot2 subscribes to the philosophy that missing values should never silently go missing.

  • Q: What does the relationship between flipper length and body mass look like?
  • A: The relationship appears to be positive, fairly linear, and moderately strong. Penguins with longer flippers are generally larger in terms of their body mass.


Adding aesthetics and layers

It is always a good idea to be skeptical of any apparent relationship between two variables and ask if there may be other variables that explain or change the nature of this apparent relationship.
For example, does the relationship between flipper length and body mass differ by species?

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm,
                y = body_mass_g,
                color = species)
) +
  geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Scaling: When a categorical variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of the aesthetic to each unique level of the variable. ggplot2 will also add a legend that explains which values correspond to which levels


Let’s add one more layer: a smooth curve displaying the relationship between body mass and flipper length.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm,
                y = body_mass_g,
                color = species)
) +
  geom_point() + 
  geom_smooth(method = 'lm')
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

When aesthetic mappings are defined in ggplot(), at the global level, they are passed down to each of the subsequent geom layers of the plot.
However, each geom function in ggplot2 can also take a mapping argument, which allows for aesthetic mappings at the local level.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm,
                y = body_mass_g) 
) +
  geom_point(mapping = aes(color = species)) +
  geom_smooth(method = 'lm')
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

It’s generally not a good idea to represent information using only colors on a plot, as people perceive colors differently due to color blindness or other color vision differences.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm,
                y = body_mass_g)
) +
  geom_point(mapping = aes(color = species, shape = species)) +
  geom_smooth(method = 'lm')
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

We can improve the labels of out plot using the labs() function in a new layer.
arguments are

  • title
  • subtitle
  • x
  • y
  • color and shape: define the label for the legend
  • scale_color_colorblind(): imporve the color palette to be colorblind safe(from ggthemes package)
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm,
                y = body_mass_g)
) +
  geom_point(mapping = aes(color = species, shape = species)) +
  geom_smooth(method = 'lm') +
  labs(
    title = 'Body mass vs. Flipper length',
    subtitle = 'Dimensions for Adelie, Chinstrap, and Gentoo Penguins',
    x = 'Flipper length (mm)',
    y = 'Body mass (g)',
    color = 'Species',
    shape = 'Species'
  ) +
  scale_color_colorblind()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).



1.2. Visualizing distributions

How to visualize the distribution of a variable depends on the type of variable

  • Categorical
  • Numerical


A categorical variable

A variable is categorical if it can only take one of a small set of values. To examine the distribution of a categoriccal variable, we can use a bar chart.

ggplot(penguins, aes(x = species)) +
  geom_bar()

In bar plots of categorical variables with non-ordered levels, its often preferable to reorder the bars based of their frequencies. It requires transforming the variable to a factor and then reordering the levels of that factor.

ggplot(penguins, aes(x = fct_infreq(species))) +
  geom_bar()


A numerical variable

A variable is numerical or quantitative if it can take on a wide range of numerical values. Numerical variables can be continuous or discrete.
One commonly used visualization for distributions of continuous variable is a histogram

ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 200)
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin.
Since different binwidths can reveal different patterns, we have to explore a variety of binwidths when working with histogram.

ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 20)
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 2000)
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

An alternative visualization for distributions of numerical variables is a density plot. A density plot is a smoothed-out version of a histogram. It shows fewer details than a histogram but can make it easier to quickly glean the shape of the distribution, particularly with respect to modes and skewness.

ggplot(penguins, aes(x = body_mass_g)) +
  geom_density()
## Warning: Removed 2 rows containing non-finite values (`stat_density()`).




1.3 Visualizing relationhsips

To visualize a relationship we need to have at least two variables.


A numerical and a categorical variable

To visualize the relationship between a numerical and a categorical variable we can use side-by-side box plots.

ggplot(penguins, aes(x = species, y = body_mass_g)) +
  geom_boxplot()
## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

Alternatively, we can make density plots with geom_density().

ggplot(penguins, aes(x = body_mass_g, color = species)) +
  geom_density()
## Warning: Removed 2 rows containing non-finite values (`stat_density()`).

ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) +
  geom_density(linewidth = 2, alpha = 0.7)
## Warning: Removed 2 rows containing non-finite values (`stat_density()`).


Two categorical variables

We can use stacked bar plot to visualize the relationship between two categorical variables.

ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar()

The second plot is a relative frequency plot. It is more useful for comparing species distributions across the islands since it’s not affected by the unequal numbers of penguins across the islands.

ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(position = 'fill')


Two numerical variables

For visualizing the relationship between two numerical variables, we can use scatter plot and smooth curves.

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).


Three or more variables

We can incorporate more variables into a plot by mapping them to additional aesthetics.

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = island))
## Warning: Removed 2 rows containing missing values (`geom_point()`).

However adding too many aesthetic mappings to a plot makes it cluttered and difficult to make sense of.
Another way is to split our plot into facets. To facet out plot by a single variable, use facet_wrap().

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = species)) +
  facet_wrap(~island)
## Warning: Removed 2 rows containing missing values (`geom_point()`).




1.4 Saving plots

ggsave() will save the plot most recently created to disk. If we don’t specify the width and height they will be taken from the dimensions of the current plotting device.

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

# ggsave(filename = 'penguin-plot.png')
# ggsave(filename = 'penguin-plot.pdf')




Data transformation

1. Introduction

It’s rare that we get the data in exactly the right form we need to make the graph we want. Often we’ll need to create some new variables or summaries. Also we may want to rename the variable or reorder the observations.

Goals - dplyr package - overview of all the key tools for tranforming a data frame - understand pipe, which is important tool when combining verbs

library(nycflights13)
library(tidyverse)


nycflights13

To explore the basic dplyr verbs, we’re going to use nycflights13::flights.

flights

flights is a tibble, a special type of data frame used by the tidyverse. The most important difference between tibbles and data frames is the way tibbles print. They are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen.

  • View(tibble): open an interactive scrollable and filterable view
  • print(tibble, width = Inf): show all columns
  • glimpse(tibble)
glimpse(flights)
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…


dplyr basics

Common rules of dplyr

  • The first argument is always a data frame
  • The subsequent arguments typically describe which columns to operate on, using the vairate names
  • The output is always a new data frame

pipe operator |> - x |> f(y): f(x, y) - x |> f(y) |> g(z): g(f(x, y), z)

flights |> 
  filter(dest == 'IAH') |> 
  group_by(year, month, day) |> 
  summarize(
    arr_delay = mean(arr_delay, na.rm = T)
  )
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.

dplyr’s verbs are organized into four groups based on what they operate on:

  • rows
  • columns
  • groups
  • tables




2. Rows

The most important verbs that operate on rows of a dataset are

  • filter()
  • arrange()
  • distinct()


filter()

filter() allows us to keep rows based on the values of the columns. When we run filter(), dplyr executes the filtering operation, creating a new data frame. It doesn’t modify the existing dataset. So if we want to save the result, we must use the assignment operator <-.

arguments are:

  • data frame
  • conditions
# departed more than 120 minutes late
flights |> 
  filter(dep_delay > 120)

we can also use < <= > >= == != and combine conditions with & , |. There is a useful shortcut when we are combining | and ==: %in%.

# flights that departed on January 1
flights |> 
  filter(month == 1 & day == 1)
# flights that departed in January or Februray
flights |> 
  filter(month == 1 | month == 2)
flights |> 
  filter(month %in% c(1, 2))
jan1 <- flights |> 
  filter(month == 1 & day == 1)
jan1


arrange()

arrange() changes the order of the rows based on the value of the columns. If we provide more than one columns name, each additional column will be used to break ties in the values of preceding columns. Ascending is defualt and when we want to order by descending, use desc(column name).

arguments are:

  • data frame
  • set of columns
# 가장 빨리 출발한 순서로 정렬
flights |> 
  arrange(year, month, day, dep_time)
# 가장 지연이 오래된 순서로 정렬
flights |> 
  arrange(desc(dep_delay))


distinct()

distinct() finds all the unique rows in a dataset. However, most of the time, we’ll want the distinct combination of some variables, so we can also optionally supply column names. If we want to keep other columns when filtering for unique rows, we can use the .keep_all = T

# remove duplicate rows
flights |> 
  distinct()
# find all unique origin and destination pairs
flights |> 
  distinct(origin, dest)
flights |> 
  distinct(origin, dest, .keep_all = T)
# count(): find the number of occurrences
# sort = T: arrange then in descending order of number of occurrences
flights |> 
  count(origin, dest, sort = T)




3. columns

There are four important verbs that affect the columns.

  • mutate()
  • select()
  • rename()
  • `relocate()``


mutate()

The job of mutate() is to add new columns that are calculated from the existing columns.

By default, mutate() adds new columns on the right hand side of our dataset. .before argument add the variables to the left hand side. Also we can use .after argument and both in .before and .after we can use variable name instead of a position.

Alternatively, we can control which variables are kept with the .keep argument.

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60
  )
flights |> mutate(
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60,
  .before = 1
)
flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .after = day
  )
flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    hours = air_time / 60,
    gain_per_hour = gain / hours,
    .keep = 'used'
  )


select()

select() allows us to rapidly zoom in on a useful subset using operations based on the names of the variables.

  • select columns by name
flights |> 
  select(year, month, day)
  • select all columns between year and day
flights |> 
  select(year:day)
  • select all columns except those from year to day
# can also use - instead of !
flights |> 
  select(!year:day)
  • select all columns that are characters
flights |> 
  select(where(is.character))

There are a number of helper functions we can use within select()

  • starts_with()
  • ends_with()
  • contains()
  • num_range('x', 1:3)

We can rename variables using =

flights |> 
  select(tail_num = tailnum)


rename()
flights |> 
  rename(tail_num = tailnum)


relocate()

Use relocate() to move variables around. By default relocate() moves variables to the front. We can also specify where to put them using .before and .after arguments just like in mutate().

flights |> 
  relocate(time_hour, air_time)
flights |> 
  relocate(year:dep_time, .after = time_hour)
flights |> 
  relocate(starts_with('arr'), .before = dep_time)